IDa-Det: An Information Discrepancy-Aware Distillation for 1-bit Detectors

175

Proposal Pair ܴ

ǡ ܴ

Proposal Pair ܴ

ǡ ܴ

Proposal Pair ܴ

ǡ ܴ

Proposal Pair ܴ

ǡ ܴ

Paired ܴin Student

ܴin Student

ܴin Teacher

Paired ܴin Teacher

FIGURE 6.16

Illustration for the generation of the proposal pairs. Every single proposal in one model

generates a counterpart feature map patch in the same location as the other model.

channel-wise proposal feature and measure the discrepancy as

εn =

C



c=1

||(Rt

n;c Rs

n;c)T Σ1

n;c(Rt

n;c Rs

n;c)||2,

(6.83)

where Σn;c denotes the covariance matrix of the teacher and the student in the c-th channel

of the n-th proposal pair. The Mahalanobis distance takes into account both the pixel-

level distance between proposals and the differences in statistical characteristics in pair of

proposals.

To select representative proposals with maximum information discrepancy, we first de-

fine a binary distillation mask mn as

mn =



1, if pair (Rt

n, Rs

n) is selected

0, otherwise

(6.84)

where mn = 1 denotes that the distillation will be applied on this proposal pair; otherwise,

it remains unchanged. For each pair of proposals, only when their distribution is quite

different can the student model learn from the teacher counterpart where a distillation

process is needed.

On the basis of the derivation above, discrepant proposal pairs will be optimized through

distillation. To distill the selected pairs, we resort to maximizing the conditional probability

p(Rs

n|Rt

n). That is, after distillation or optimization, the feature distributions of the teacher

proposals and the student counterparts become similar. To this end, we define p(Rs

n|Rt

n)

with mn, n ∈{1, · · · , NT + NS} in consideration as

p(Rs

n|Rt

n; mn)mnN(μt

n, σt

n

2) + (1mn)N(μs

n, σs

n

2).

(6.85)

Subsequently, we introduce a bilevel optimization formulation to solve the distillation prob-

lem as

max

Rsn

p(Rs

n|Rt

n; m),n ∈{0, · · · , NT + NS},

s.t. m= arg max

m

NT +NS



n=1

mnεn,

(6.86)

where m = [m1, · · · , mNT +NS] and ||m||0 = γ · (NT + NS). γ is a hyperparameter. In

this way, we select γ · (NT + NS) pairs of proposals that contain the most representative